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Abstract. We establish a limit formula for the median of the distance between two 
leaves in a fully resolved unrooted phylogenetic tree with n leaves. More precisely, we 
prove that this median is equal, in the limit, to y/A ln(2)n. 

1 Introduction 

The definition and study of metrics for the comparison of phylogenetic trees is a classical 
problem in phylogenetics [1, Ch. 30], motivated, among other applications, by the need 
to compare alternative phylogenies for a given set of organisms obtained from different 
datasets or using different methods. Many metrics for the comparison of rooted or 
unrooted phylogenetic trees on the same set of taxa have been proposed so far. Some of 
the most popular such metrics are based on the comparison of the vectors of distances 
between pairs of taxa in the corresponding trees. But, in contrast with other metrics, 
the statistical properties of these metrics are mostly unknown. 

Steel and Penny [3] computed the mean value of the square of the metric for fully 
resolved unrooted trees defined through the euclidean distance between their vectors 
of distances (they called it the path difference metric). One of the main ingredients 
in their work was the explicit computation of the mean value and the variance of the 
distance d between two leaves in a fully resolved unrooted phylogenetic tree with n 
leaves, obtaining that 

22(n-2) 

V( d ) = ( 2{n-2)\ ~ V^»> Var(d) = 4n - 6 - //(d) - fi(d) 2 

\ n-2 j 

In this work we continue the statistical analysis of this random variable d, by giving 
an expression for its median that allows the derivation of a limit formula for it. We 
hope our result will constitute a first step towards obtaining a formula for the median 
of the aforementioned squared path difference metric between fully resolved unrooted 
phylogenetic trees, a problem that still remains open. 



2 Preliminaries 



In this paper, by a phylogenetic tree on a set S we mean a fully resolved (that is, with 
all its internal nodes of degree 3) unrooted tree with its leaves bijectively labeled in 



the set S. Although in practice S may be any set of taxa, to fix ideas we shall always 
take S = {1, . . . , n}, with n the number of leaves of the tree, and we shall use the term 
phylogenetic tree with n leaves to refer to a phylogenetic tree on this set. For simplicity, 
we shall always identify a leaf of a phylogenetic tree with its label. 

Let T n be the set of (isomorphism classes of) phylogenetic trees with n leaves. It 
is well known [1] that \Ti\ = \T 2 \ = 1 and \T n \ = (2n - 5)!! = (2n - 5)(2n - 7) • • • 3 • 1, 
for every n ^ 3. 



3 Main result 

Let k, I € S = {1, . . . , n} be any two different labels of trees in T n - The distance dx{k, I) 
between the leaves k and I in a phylogenetic tree T 6 7^ is the length of the unique 
path between them. Let's consider the random variable 

dki = distance between the labels k and I in one tree in T n - 

The possible values of dki are 1, 2, . . . , n — 1. 

Our goal is to estimate the value median (n) of the median of this variable dki on 
T n when the tree and the leaves are chosen equiprobably. In this case, dki = d\2, and 
thus we can reduce our problem to compute the median of the variable d := d\2- 

For every i = 1, . . . , n— 1, let Cj be the cardinal of {T G T n \ dr(l ; 2) = i). Arguing 
as in [3, p. 140], we have the following result. 

Lemma 1. c Tt _i = (n — 2)! and, for every i = 1, . . . , n — 2, 

(j-l)(n-l)...(2n-z-4) = (i - l)(2n - i - 4)! 
* 1 J ' (2(n-i-l))!! (2(n-i-l))!! ' 

Proof. Consider the function B{x) = 1 — \Jl — 2x. By [3, p. 140], we have that q = 
£^ {B{xy-% =0 . Using that 

B{x y-± ^ + H^iVi + . . ■ + ^- 1 )^ + ^ ) ;(- + 2f - 2 )^ + . . . , 

we obtain the formulas in the statement. □ 

„ 1 2 fe (n-3)!(-A: + 2n-4)! 

Lemma 2. For e.en/ k = 1, . . . , n-1, g c, = 1- 2(2n _ 5)!( _ fc + n _ 2) , • 

Proof. Taking into account that (2j)!! = 2- ? j! and (2j + 1)!! = ^'j^* > for every j G N, 
and using Lemma 1, we have: 

1 A (n-3)! A (i- l)2'(2n-z-4)! 

~ 4r2n-M! 



(2n-5)!!^ 4(2n-5)!^ (n-i-1)! 

(n -3)! i2 i+1 (2n - i - 5)! 



4(2n-5)!^ (n-i-2)! 



We use now the method in [2, Chap. 5] to compute = Yli=i %2 \n^i-2)\ ^ • 
Set U = i2 i+1 {2n -i-h)\/(n-i- 2)!. Then 

U+i _ 2(1 + i)(2 + i - n) 
U ~ i(5 + i- 2n) 

The next step is to find three polynomials a(i), b(i) and c(i) such that 

U+i _ c(i + 1) 

U b(i) c(i) 

We take a(i) = 2(2 + i — n), 6(i) = 5 + i — 2n and c(i) = i. Next, we have to find a 
polynomial x(i) such that a(i)x(i + 1) — b(i — l)x(i) = c(i). The polynomial x(i) = 1 
satisfies this equation. Then, by [2, Chap. 5], 

b(k-l)x(k) (4 + k - 2n)2 k+1 (2n - k - 5)! 

where g is a function of n. We find this function from the case k = 2: 

4(2n-6)! „ 8(6 - 2n)(2n - 7)! . , 

-, ^~ = S 2 = - r -^. L +g(n). 

(n — 3)! (n — 4)! 

4(2n — 5)! 

From this equality we deduce that gin) = —. We conclude that: 

(n — 3)! 

_ i2 i+1 (2n - i - 5)! _ (4 + fc - 2n)2 fe+1 (2n — fc — 5)! 4(2n - 5)! 
k ~ht (n-i-2)\ ~ (n-k -2)1 + (n-3)! ' 

The formula in the statement follows from this expression. □ 

median(n) , _, m median(n) 

Theorem 1. — ■ = 1 + G> n ' . in particular, hm — , = 1. 

v / 4ln(2)n v 7 • N /41n(2)n 

Proof. To simplify the notations, we shall denote median(n) by k. By definition, 

f , |7;h f, . 2 k (n — 3)!( — /c + 2?i — 4)! li 

fc = max k G N > Cj < — U max fc G N — ^— ^ - }. 

^ 2 / I 1 2(2n-5)!(-fc + n-2)! 2 J 

Thus, A; is the largest integer value such that 

2 k (n - 3)!(-jfe + 2n - 4)! ^ (2n - 5)!(-jfe + n - 2)!. 
If we simplify this inequation and take logarithms, this condition becomes 

hi / 2 n-(j + 2)\ £i (2-i^' 



n — i 

3=3 y J 3=3 



Combining the development of the function ln( 2 ^~?} x ) in x = 0, 

In ( 2 " 1 ( ^ x 2)X ) = ln(2) + 2)x + I(j - 2)(3j + 2)x 2 + O (x 3 ) , 

with equation (1), we obtain: 

M 2 )^|(i- 2 » + og)^ + g). 

So, the first order term of the median k will be the largest integer value that satisfies 
k /An ^ hi(2). Therefore, the median will be the closest integer to y / 41n(2)n, from 
where the thesis in the statement follows. □ 

4 Conclusions 

We have obtained a limit formula for the median of the distance between two leaves 
in a fully resolved unrooted phylogenetic tree with n leaves. Our method allows to 
find more terms of the development of the median. For instance, it can be proved that 
median(n) ps y/An In 2 + (| — In 2). 

The limit formula obtained in this work can be generalized to the p-percentile 
x p = max {fc € N | ^2i =1 Ci ^ \Tn\p}. Indeed, using our method we obtain that 

x p « -y/— 4m(l — p)n. 
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